Optimized hardware architecture and method for ecc point doubling using jacobian coordinates over short weierstrass curves

ABSTRACT

An optimized hardware architecture and method introducing a simple arithmetic processor that allows efficient implementation of an Elliptical Curve Cryptography point doubling algorithm for Jacobian coordinates. The optimized architecture additionally reduces the required storage for intermediate values.

BACKGROUND

Electronic devices are becoming a ubiquitous part of everyday life. Thenumber of smartphones and personal tablet computers in use is rapidlygrowing. A side effect of the increasing use of smartphones and personaltablets is that increasingly the device are used for storingconfidential data such as personal and banking data. Protection of thisdata against theft is of paramount importance.

The field of cryptography offers protection tools for keeping thisconfidential data safe. Based on hard to solve mathematical problems,cryptography typically requires highly computationally intensivecalculations that are the main barrier to wider application in cloud andubiquitous computing (ubicomp). If cryptographic operations cannot beperformed quickly enough, cryptography tools are typically not acceptedfor use on the Internet. In order to be transparent while stillproviding security and data integrity, cryptographic tools need tofollow trends driven by the need for high speed and the low powerconsumption needed in mobile applications.

Public key algorithms are typically the most computationally intensivecalculations in cryptography. For example, take the case of EllipticCurve Cryptography (ECC), one of the most computationally efficientpublic key algorithms. The 256 bit version of ECC provides security thatis equivalent to a 128 bit symmetric key. A 256 bit ECC public keyshould provide comparable security to a 3072 bit RSA public key. Thefundamental operation of ECC is a point multiplication which is anoperation heavily based on modular multiplication, i.e. approximately3500 modular multiplications of 256 bit integers are needed forperforming one ECC 256 point multiplication. Higher security levels(larger bit integers) require even more computational effort.

Building an efficient implementation of ECC is typically non-trivial andinvolves multiple stages. FIG. 1 illustrates stages 101, 102 and 103that are needed to realize the Elliptical Curve Digital SignatureAlgorithm (ECDSA), which is one of the applications of ECC. Stage 101deals with finite field arithmetic that comprises modular addition,inversion and multiplication. Stage 102 deals with point addition andpoint doubling which comprises the Joint Sparse Form (JSF), Non-AdjacentForm (NAF), windowing and projective coordinates. Finally, stage 103deals with the ECDSA and the acceptance or rejection of the digitalsignature.

Any elliptic curve can be written as a plane geometric curve defined bythe equation of the form (assuming the characteristic of the coefficientfield is not equal to 2 or 3):

y ² =x ³ +ax+b  (1)

that is non-singular; that is it has no cusps or self-intersections andis known as the short Weierstrass form where a and b are integers. Thecase where a=−3 is typically used in several standards such as thosepublished by NIST, SEC and ANSI which makes this the case of typicalinterest.

Many algorithms have been proposed in the literature for efficientimplementation of the Point Addition (PDBL) and Point Doubling (PDBL)operations. Many of these algorithms are optimized for softwareimplementation. While these are typically efficient on certainplatforms, the algorithms are typically not optimal once the underlyinghardware can be tailored to the algorithm.

A PDBL algorithm for Jacobian coordinates has been described by Cohen,Miyaji and Ono in Proceedings of the International Conference on theTheory and Applications of Cryptography and Information Security;Advances in Cryptology, ASIACRYPT 1998, pages 51-65, Springer-Verlag,1998. Jacobian coordinates are projective coordinates where each pointis represented as three coordinates (X, Y, Z). Note the coordinates areall integers. PDBL algorithm 200 requires 4 modular multiplications, 4modular squarings, 4 modular subtractions, one modular addition, onemodular multiplication by 2 and one modular division by 2 and is shownin FIG. 2. In order to perform the PDBL, the algorithm further requiresa minimum of 3 temporary registers, which for ECC 256 bit each need tobe 256 bits in size. All operations are done in the finite field K overwhich the elliptic curve E is defined. The finite arithmetic field K isdefined over the prime number p so that all arithmetic operations areperformed modulo p. The identity element is the point at infinity.

SUMMARY

An optimized hardware architecture and method reduces storagerequirements and speeds up the execution of the ECC PDBL algorithm byrequiring only two temporary storage registers and by introducing asimple arithmetic unit for performing modular addition, subtraction andmultiplication and division by 2.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows stages 101, 102 and 103 that are needed to realize theElliptical Curve Digital Signature Algorithm (ECDSA).

FIG. 2 shows a prior art point doubling algorithm.

FIG. 3 shows an embodiment in accordance with the invention.

FIG. 4 shows an embodiment in accordance with the invention.

FIG. 5 shows an embodiment in accordance with the invention.

FIG. 6 shows an embodiment in accordance with the invention.

FIG. 7 shows an embodiment in accordance with the invention.

DETAILED DESCRIPTION

PDBL algorithm 300 in accordance with the invention is shown in FIG. 3.PDBL algorithm 300 requires fewer steps and reduces the storagerequirements compared to PDBL algorithm 200 for the same modular pointdoubling. PDBL algorithm 300 requires only two temporary storageregisters, T₁ and T₂. PDBL algorithm 300 is implemented over anoptimized hardware architecture shown in FIG. 6 and FIG. 7 andspecifically designed to take advantage of PDBL algorithm 300.

As input in step 301, PDBL algorithm 300 shown in FIG. 3 takes pointP=(X₁, Y₁, Z₁) in Jacobian coordinates. T₁ and T₂ are temporary storagevariables. Note that all mathematical operations shown are in modulararithmetic and all coordinates are Jacobian. In step 302 of PDBLalgorithm 300, if P=∞ (the identity element) the value ∞ is returned. Instep 303, the coordinate Z₁ is squared (Z₁*Z₁) and subtracted from X₁with the resulting value stored in temporary register T₂. In step 304,3T₂*(2X₁−T₂) is calculated and the resulting value stored in temporaryregister T₂. In step 305, T₂ is squared and the result stored in X₃. Instep 306, 2Y₁*Z₁ is calculated, the result stored in Z₃. In step 307,2Y₁ is calculated and squared (2Y₁*2Y₁) with the result stored in Y₃. Instep 308, X₃−2Y₃*X₁ is calculated and the result stored in X₃. In step309, (Y₃*X₁−X₃) is calculated and multiplied by T₂ and the result isstored in T₁. Note that the quantity Y₃*X₁ was already calculated instep 308 so step 309 only requires a single modular multiplication (byT₂). In step 310, T₁−Y₃*Y₃/2 is calculated and the result is stored inY₃. Finally, in step 311 the result of the point doubling of P isreturned in Jacobian coordinates as (X₃, Y₃, Z₃).

The most computationally intensive operation in PDBL algorithm 300 inFIG. 3 is modular multiplication denoted by “*”. Because most of thesteps described in PDBL algorithm 300 depend on the previous steps ofthe algorithm, it is typically most efficient to implement PDBLalgorithm 300 in hardware using a single modular multiplier althoughmore than one modular multiplier may be used in accordance with theinvention which allows more than one modular multiplication to beperformed in a step. Using only one modular multiplier restricts eachstep in PDBL algorithm 300 to having no more than one modularmultiplication.

It is important to note that besides the modular multiplication stepsperformed in steps 303, 308 and 309 of PDBL algorithm 300, additional,comparatively simple operations are performed as well: modularsubtraction and addition and modular multiplication and division by 2.Note that multiplication or division by a power of 2 in binary is merelya shift operation. In order to accelerate execution of PDBL algorithm300 and eliminate the need for additional temporary registers, anembodiment in accordance with the invention of simple arithmetic unit(SAU) 400 with the inputs and outputs as shown in FIG. 4 is used.

FIG. 5 shows how steps 303, 304, 306, 307, 308, 309 and 310 are brokendown to take advantage of SAU 400 which has inputs A, B and C withoutputs D and E. Note that the input and output labels of SAU 400correspond to the respective variable names in FIG. 5. Block 501 showshow step 303 of PDBL algorithm 300 is broken down using SAU 400 andinvolves setting inputs A=X₁ and B=Z₁ ² with output E=A−B. Block 502shows how step 304 of PDBL algorithm 300 is broken down using SAU 400and involves setting inputs A=X₁, B=T₂ and C=T₂ with outputs D=3C andE=2A−B. Outputs D and C are then multiplied together and the result isstored in temporary register T₂. Block 503 shows how step 306 of PDBLalgorithm 300 is broken down using SAU 400 and involves setting A=Y₁ andB=0 with output E=2A−B. Output E is then multiplied by Z₁ and the resultis stored in Z₃. Block 504 shows how step 307 of PDBL algorithm 300 isbroken down using SAU 400 and involves setting inputs A=Y₁ and B=0 withoutput E=2A−B. Output E is then multiplied by itself and the result isstored in Y₃. Block 505 shows how step 308 of PDBL algorithm 300 isbroken down using SAU 400 and involves setting inputs A=X₃ and B=X₁*Y3with output E=A−2B. Output E is stored in X₂. Block 506 shows how step309 of PDBL algorithm 300 is broken down using SAU 400 and involvessetting input A=X₁*Y₃ and B=X₃ with output E=A−2B. Note that step 309reuses the result of step 308 for X₁*Y₃ (stored in the output registerof the multiplier). Output E is stored in X₃. Block 507 shows how step310 of PDBL algorithm 300 is broken down using SAU 400 and involvessetting inputs A=T₁ and B=Y₃ ² with output E=A−B/2. Note that “don'tcare” indicates the value is irrelevant to the calculation beingperformed in the respective steps.

FIG. 6 shows embodiment 600 in accordance with the invention comprisingmulti-cycle multiplier 610 with output register (not shown), SAU 400,multiplexer (MUX) 620 and MUX 630 with input registers X₁, Y₁, Z₁, (x₂,y₂—not used), output registers X₃, Y₃, Z₃ and temporary registers T₁ andT₂ that are all part of register memory 695. Note the individualregister labels correspond to variable names in FIGS. 3 and 5. MUX 620,630 and 740 (part of SAU 400, see FIG. 7) are controlled by themicroprocessor (not shown) which schedules the steps of PDBL algorithm300. As noted above, each step in PDBL algorithm 300 involve at most onemodular multiplication by multi-cycle multiplier 610 (not countingmultiplication or division by 2 which in binary representation is merelya shift operation).

SAU 400 shown in FIG. 7 comprises subtractor 710 and adder 711, logicalone bit left shifter 715 (multiplication by 2), logical one bit rightshifter 716 (division by 2), logical one bit left shifter 717(multiplication by 2), logical one bit left shifter 718 (multiplicationby 2), MUX 720 and MUX 725.

Input A goes to both input “0” of MUX 720 and logical one bit leftshifter 715 on line 671. Logical one bit left shifter 715 multipliesinput A by two and outputs 2A on line 771 to the “1” input of MUX 720.Output line 776 of MUX 720 provides the minuend input for subtractor710. Input B goes to logical one bit right shifter 716, logical one bitleft shifter 717 and input “1” of MUX 725 on line 672. Logical one bitright shifter 716 divides input B by two and outputs B/2 on line 772 toinput “0” of MUX 725. Logical one bit left shifter 717 multiplies inputB by two and outputs 2B on line 774 to input “2” of MUX 725. Output line777 of MUX 725 connects to the subtrahend input of subtractor 710. InputC connects to adder 722 and to logical one bit left shifter 718 on line673. Logical one bit left shifter 718 multiplies input C by two andoutputs 2C to adder 722 on line 775. Subtractor 710 outputs E (see FIG.4) on line 696. Adder 722 outputs D (=3C) on line 690.

Multi-cycle multiplier 610 functions by multiplying the values on lines635 and 640 together and outputting the result on lines 650 and 650.Steps 301-302 of PDBL algorithm 300 are performed on the microprocessor(not shown) without using multi-cycle multiplier 610 and SAU 400.

Step 303 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides X₁ on line 665 to input “0” of MUX 620 with MUX 620set to “0” and Z₁ is provided from register memory 695 on both lines 635and 640 to multi-cycle multiplier 610. Multi-cycle multiplier 610computes Z₁ ² which is output on line 650 to input “1” of MUX 630 withMUX 630 set to “1”. MUX 620 sends X₁ to input A of SAU 400 on line 671and MUX 630 sends Z₁ ² to input B of SAU 400 on line 672. MUX 720 in SAU400 is set to “0” and MUX 720 sends A on line 776 from line 671 to theminuend input of subtractor 710 on line 776. MUX 725 in SAU 400 is setto “1” and MUX 725 sends on line 777 B from line 672 to the subtrahendinput of subtractor 710 on line 777. Subtractor 710 computes E (which isA−B=X₁−Z₁ ²) of which is passed to register memory 695 on line 696 andstored in temporary register T₂.

Step 304 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides X₁ on line 665 to input “0” of MUX 620 and MUX 620is set to “0”. MUX 620 sends X₁ to input A of SAU 400 on line 671.Register memory 695 provides T₂ on line 660 to input “0” of MUX 630 withMUX 630 set to “0” and register memory 695 also provides T₂ to input Cof SAU 400 on line 673. MUX 720 in SAU 400 is set to “1” and MUX 720sends 2A from line 771 on line 776 to the minuend input of subtractor710. MUX 725 in SAU 400 is set to “1” and MUX 725 sends B from inputline 672 on line 777 to the subtrahend input of subtractor 710 on line777. Input C (T₂) of SAU 400 on line 673 is sent to both logical one bitleft shifter 718 and adder 720. The output 2C on line 775 from logicalone bit left shifter 718 goes to adder 720. Adder 720 outputs D (whichis 3C=3T₂) on line 690 and subtractor 710 computes E (which is2A−B=2X₁−T₂) on line 696 to register memory 695 which passes E and D onlines 635 and 640, respectively, to multi-cycle multiplier 610 whichcomputes E*D and sends the result on line 650 to register memory 695where the result is stored in temporary register T₂.

Step 305 utilizes multi-cycle multiplier 610. T₂ is provided fromregister memory 695 to both lines 635 and 640 to multi-cycle multiplier610 which computes and outputs T₂ ² on line 650 to register memory 695where the result is stored in X₃.

Step 306 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides Y₁ on line 665 to input “0” of MUX 620 and MUX 620is set to “0”. MUX 620 sends Y₁ to input A of SAU 400 on line 671.Logical one bit left shifter 718 takes input A on line 671, multipliesinput A by two and outputs 2A on line 771 to MUX 720. MUX 720 in SAU 400is set to “1” and MUX 720 sends 2A on line 776 to the minuend input ofsubtractor 710. Binary 0 is supplied on line 660 to input “0” of MUX 630with MUX 630 set to “0”. MUX 630 sends binary 0 from line 660 to input Bof SAU 400 on line 672. MUX 725 in SAU 400 is set to “1” and MUX 725sends binary 0 on line 777 to the subtrahend input of subtractor 710.Subtractor 710 computes 2A−B on line 696 to register memory 695 as E(which is 2A−B=2Y₁) which passes the value through on line 635 tomulti-cycle multiplier 610 and register memory 695 provides Z₁ on line640 to multi-cycle multiplier 610. Multi-cycle multiplier 610 computesE*Z₁ (2Y₁*Z₁) and sends the result on line 650 to register memory 695where it is stored in Z₃.

Step 307 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides Y₁ on line 665 to input “0” of MUX 620 and MUX 620is set to “0”. MUX 620 sends Y₁ to input A of SAU 400 on line 671.Logical one bit left shifter 715 takes input A on line 671, multipliesinput A by two and outputs 2A on line 771 to input “1” of MUX 720. MUX720 in SAU 400 is set to “1” and MUX 720 sends 2A on line 776 to theminuend input of subtractor 710. Binary 0 is supplied on line 660 toinput “0” of MUX 630 with MUX 630 set to “0”. MUX 630 sends binary 0from line 660 to input B of SAU 400 on line 672. MUX 725 in SAU 400 isset to “1” and MUX 725 sends binary 0 on line 777 to the subtrahendinput of subtractor 710. Subtractor 710 computes 2A−B (which is 2Y₁) asE on line 696 to register memory 695 which passes E through both on line635 and on line 640 to multi-cycle multiplier 610. Multi-cyclemultiplier 610 computes E² (which is (2Y₁)²) and sends the result toregister memory 695 on line 650 where it is stored in Y₃.

Step 308 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides X₃ on line 665 to input “0” of MUX 620 and MUX 620is set to “0”. MUX 620 sends X₃ to input A of SAU 400 on line 671 whichconnects to input “0” of MUX 720 with MUX 720 set to “0”. MUX 720 sendsA on line 776 to the minuend input of subtractor 710. Register memory695 provides Y₃ on line 635 to multi-cycle multiplier 610 and providesX₁ on line 640 to multi-cycle multiplier 610. Multi-cycle multiplier 610computes Y₃*X₁ and sends the result to input “1” of MUX 630 and MUX 630is set to “1”. MUX 630 sends Y₃*X₁ to input B of SAU 400 on line 672.Logical one bit left shifter 717 takes input B on line 672, multipliesinput B by two and outputs 2B (2Y₃*X₁) on line 774 to input “2” of MUX720. MUX 720 is set to “2” and sends 2B on line 777 to the subtrahendinput of subtractor 710. Subtractor 710 computes E (which isA−2B=X₃−2Y₃*X₁) on line 696 to register memory 695 where it is stored inX₃.

Step 309 utilizes both multi-cycle multiplier 610 and SAU 400. In step308, Y₃*X₁ was computed by multi-cycle multiplier 610. Hence, Y₃*X₁ isstill present in the output register (not shown) of multi-cyclemultiplier 610 and in Step 309 is sent on line 650 to input “1” of MUX620 and MUX 620 is set to “1”. MUX 620 provides Y₃*X₁ to input A of SAU400 on line 671 which connects to input “0” of MUX 720. MUX 720 in SAU400 is set to “0” and MUX 720 sends A (which is Y₃*X₁) on line 776 tothe minuend input of subtractor 710. Register memory 695 provides X₃ online 660 to input “0” of MUX 630 and MUX 630 is set to “0”. MUX 630sends X₃ to input B of SAU 400 on line 672 which connects to input “1”of MUX 725. MUX 725 is set to “1” and provides B on line 777 to thesubtrahend input of subtractor 710. Subtractor 710 computes E(A−B=Y₃*X₁) which is sent on line 696 to register memory 695 whichpasses the value through on line 635 to multi-cycle multiplier 610 andregister memory 695 provides T₂ on line 640 to multi-cycle multiplier610. Multi-cycle multiplier 610 computes E*T₂ (which is (Y₃*X₁−X₃)*T₂)and sends the result on line 650 to register memory 695 where it isstored in temporary register T₁.

Step 310 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides T₁ on line 665 to input “0” of MUX 620 and MUX 620is set to “0”. MUX 620 sends T₁ to input A of SAU 400 on line 671 whichconnects to input “0” of MUX 720. MUX 720 in SAU 400 is set to “0” andMUX 720 sends A (T₁) on line 776 to the minuend input of subtractor 710.Y₃ is provided from register memory 695 to both lines 635 and 640 tomulti-cycle multiplier 610 which computes Y₃ ² and which is output online 650 to input “1” of MUX 630 with MUX 630 set to “1”. MUX 630provides Y₃ ² on line 672 to input B of SAU 400. Logical one bit rightshifter 716 takes input B on line 672, divides input B by two andoutputs B/2 (Y₃ ²/2) to input “0” of MUX 725 and MUX 725 is set to “0”.MUX 725 sends B/2 on line 777 to the subtrahend input of subtractor 710.Subtractor 710 computes E (A−B/2=T₁−Y₃ ²/2) which is sent on line 696 toregister memory 695 where it is stored in Y₃.

Step 311 is performed in the microprocessor and returns the result ofPDBL algorithm 300 which is (X₃, Y₃, Z₃) for input (X₁, Y₁, Z₁).

1. An apparatus for performing an elliptic curve cryptography pointdoubling operation using Jacobian coordinates comprising: a registermemory for storing a point in Jacobian coordinates; a modular multiplierelectrically coupled to the register memory; and a simple arithmeticprocessor electrically coupled to the register memory and the modularmultiplier, wherein the simple arithmetic processor is configured toperform modular subtraction, modular multiplication by two and modulardivision by two in support of the point doubling operation comprising aplurality of steps in Jacobian coordinates.
 2. The apparatus of claim 1wherein the simple arithmetic processor comprises three logical one bitleft shifters.
 3. The apparatus of claim 1 wherein the simple arithmeticprocessor is configured to output 3C for an input of a variable C. 4.The apparatus of claim 1 wherein the point doubling operation isperformed over a short Weierstrass curve of the form y=x³+ax+b wherea=−3.
 5. The apparatus of claim 1 wherein the register memory isconfigured for two temporary storage variables, T₁ and T₂.
 6. A mobiledevice comprising the apparatus of claim
 1. 7. A smartcard comprisingthe apparatus of claim
 1. 8. The mobile device of claim 6 wherein themobile device is a smartphone.
 9. The apparatus of claim 1 wherein themodular multiplier is configured to perform at most one modularmultiplication for each one of the plurality of steps.
 10. The apparatusof claim 1 wherein the simple arithmetic processor is configurable tooutput A−B/2 for an input of variables A and B.
 11. A method forperforming an elliptic curve cryptography point doubling operation usingJacobian coordinates comprising: accepting the input of a point inJacobian coordinates into a computational device having a registermemory, a modular multiplier and a simple arithmetic processorconfigured for modular subtraction, modular division by two and modularmultiplication by two; enabling the computational device to execute asequence of steps to perform the elliptic curve cryptography pointdoubling operation of the point wherein the modular multiplier performsat most one modular multiplication per step.
 12. The method of claim 11wherein the simple arithmetic processor is configurable to output A−B/2for an input of variables A, B.
 13. The method of claim 11 wherein thesequence of steps requires no more than two temporary variables.
 14. Themethod of claim 11 wherein the computational device is part of a mobiledevice.
 15. The method of claim 14 wherein the mobile device is asmartphone.
 16. The method of claim 11 wherein the computational deviceis part of a smartcard.
 17. The method of claim 11 further comprisingenabling the computational device to output a result of the pointdoubling operation in Jacobian coordinates.